Skip to content

docs: Add GEPA prompt optimization and human agreement specs#95

Open
vivian-xie-db wants to merge 2 commits into
mainfrom
add_optmization_human_agreement_spec
Open

docs: Add GEPA prompt optimization and human agreement specs#95
vivian-xie-db wants to merge 2 commits into
mainfrom
add_optmization_human_agreement_spec

Conversation

@vivian-xie-db
Copy link
Copy Markdown
Collaborator

Summary

  • PROMPT_OPTIMIZATION_SPEC.md: Declarative spec for the GEPA prompt optimization pipeline — covers MLflow optimize_prompts API, training data from annotated traces, score normalization, predict_fn behavior, custom endpoint support, config persistence, auto-reconnect, and score improvement display.
  • HUMAN_AGREEMENT_SPEC.md: Declarative spec for GDPVal A^HH human-to-human agreement metric — covers the formula E[1 - |H_1 - H_2|], rating normalization (Likert/binary), pairwise agreement %, per-metric breakdowns, and IRR integration.

Test plan

  • Specs are well-formed markdown and render correctly on GitHub
  • Content matches current implementation behavior

🤖 Generated with Claude Code

Add declarative specifications for two key features:
- PROMPT_OPTIMIZATION_SPEC.md: GEPA optimizer pipeline, training data,
  score normalization, predict_fn, config persistence, auto-reconnect
- HUMAN_AGREEMENT_SPEC.md: GDPVal A^HH human-to-human agreement metric,
  rating normalization, pairwise agreement %, IRR integration

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@FMurray
Copy link
Copy Markdown

FMurray commented Feb 13, 2026

IRR metrics are currently crammed into the Judge evaluation spec where they probably shouldn't be. We should move the IRR section out and merge it into the human agreement spec you've added here

Comment thread specs/HUMAN_AGREEMENT_SPEC.md Outdated
|--------|-----------------|-----------------|
| **GDPVal A^HH** (this spec) | Human vs Human agreement | IRR Results page |
| Pairwise Agreement % | Human vs Human agreement (percentage) | IRR Results page |
| Cohen's Kappa | Judge vs Human agreement | Judge Tuning page |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GDPval uses A^HA as well? Why Cohen's Kappa still?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cohen's Kappa is in judge tuning page. There is a small section above evaluation results where it is showing the Cohen's Kappa score, which has already been there since version 1.0

…el in judge evaluation spec

Remove the generic IRR section (Krippendorff's Alpha, Cohen's Kappa for
rater pairs) from JUDGE_EVALUATION_SPEC and replace with a detailed
Cohen's Kappa Metrics Panel spec covering the judge-vs-human agreement
metrics displayed after evaluation on the Judge Tuning page.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants